Main notebook for the Traffic Management Challenge

This is the main notebook containing the highlights, analysis and code for the Traffic Management problem statement.

This notebook is broken down into the sections:

  1. Answering and addressing the main problem statements
  2. Our process of dealing with the problem

Done by:

Answering and addressing the main problem statements

Problem Statement 1: Which areas have high / low traffic demand?

The rough breakdown of the total region provided in the dataset when looked at in real-world terms is as follows:

Originally, we assumed that the dataset provided would be from Singapore. However, after looking at the range of latitude and longitude values for the total area provided in the dataset, we can see that the total area provided is larger than that of the total area of Singapore.

As such, we can probably assume that the area provided either encompasses more than a single country or multiple countries in Southeast Asia. Regardless, as we were informed that the geohash data provided would be offset, it would be more beneficial to focus on the data itself and not try to place on the World Map.

Furthermore, as the total area is quite large, we decided it would be best to first look at the demand per region broadly. As such, in order to answer this problem statement, we identified two regions each of relatively high and low demand.

regionBreakdown.PNG

If we follow the color bar legend at the side, we can see that the brightly colored grids are regions with relatively high demand whereas the dimmer grids are regions with relatively lower demand.

The entirely black grids refer to areas which either have no data points or have little to no demand at all. If the area does not have a data point, it is because the geohash of this area was not provided in the original dataset.

On the other hand, if the area has little to no demand at all but the geohash was provided in the dataset, it is likely that this geohash zone was missing a large majority of its data and thus, it appears to have almost no demand at all in this HexPlot as we filled in the missing rows with the appropriate data for all other columns but with 0 demand.

Regardless of the reasoning, for the purposes of this analysis, it is better for us to focus on the obvious regions that have either high or low demand and not to focus on the areas with little to no demand as the analysis may not be as accurate for low demand regions then.

Regions with obviously larger demand ( Estimated lat and lon range ):

Region1 -> ( -5.30 to -5.40 ), ( 90.60 to 90.75 )
    * 11.1km by 16.65km
    * 184.815 km square
Region2 -> ( -5.33 to -5.40 ), ( 90.75 to 90.85 )
    * 7.77km by 11.1km
    * 86.247 km square


Region of relatively lower demand ( Estimated lat and lon range ):

Small region below region 1 -> ( -5.40 to -5.50 ), ( 90.65 to 90.75 )
    * 11.1km by 11.1km
    * 123.21 km square
Region to the right of region 2 -> ( -5.30 to -5.40 ), ( 90.87 to 90.95 )
    * 11.1km by 8.88km
    * 98.568 km square

1 degree (lat and lon) == 111km

We can also take a closer look at each region which allows us to see the differences in the general spread of demand within the region.

REGION 1

region1_general.PNG

REGION 2

region2_general.PNG

REGION 3

region3_general.PNG

REGION 4

region4_general.PNG

Generally, each region we identified has a smaller sub-area in which the demand is the greatest and the grids leading up to those sub-areas have a smaller demand as the distance to the sub-area increases.

Currently, we have two hypothesis.

  1. Regions of relatively high demand are Central Business District (CBD) areas whereas regions of relatively low demand are Residential areas.
  2. The sub-area within each region, whether of high or low demand, are CBD areas.

Problem Statement 2: How does regional traffic demand change according to day / time?

In order to answer this question, we made use of the regions of high and low demand that we identified earlier.

We also identified certain main trends that we wished to compare based on what we imagined to be true going off our experience with changes in traffic in Singapore.

Namely, we examined:

Overall

First, we can take a look at the trend throughout all 61 days.

generalRegion.PNG

We can see a rough 'V' shape pattern in the trend over all 61 days.

The traffic demand seems to generally decrease until around the 25th day where it starts to increase in traffic demand.

aggDemandOverTime.PNG

To generate the graph above, we took the mean of the demand every hour for the high and low demand regions respectively. As such at any point represents the mean of all the geohashes in that region at the hour across all days.

Generally, both high and low demand regions follow the same trend, with increasing demand starting from the early mornings until around the end of the peak hour period. Afterwhich, it starts to decrease in traffic demand, with a sharper decrease for high demand regions as compared to low demand regions, before increasing again at around 1900 to 2000 hours and into the next day.

This does more or less support our theory that high demand regions are CBD areas, whereby there is a significantly smaller traffic demand in the afternoons when everyone is at work especially when we take into account the sharper decline in traffic demand as compared to low demand regions.

On the other hand, these regions identified as relatively low demand regions are likely to be residential areas whereby it follows the same pattern throughout the day but with a smoother curve which is likely due to the fact that closer to home, one usually walks, takes their personal vehicles or stays at home which also accounts for the generally lower demand.

However, neither of these theories could explain why the traffic demand always increases after around 2000 hours. The most likely explanation we could come up with is that the traffic demand increases after 2000 hours as there are some inidividuals who work the night shifts.

As such, because these individuals are a minority, we can see the traffic demand increasing after 2000 hours but it never reaches the peak traffic demand as in the morning.

Another possible reason that could account for this increase in traffic demand after 2000 hours is that these regions contain entertainment facilities which could explain why the traffic demand goes back up as most people leave to go back home. However, this does not explain why the traffic demand also follows this same trend during the weekdays when most people would still have to go to work or school in the mornings.

High Demand Regions

In general, the spread of traffic demand seems to increase during the weekends for high demand regions.

This does further support the theory that high demand regions are CBD areas as CBD areas usually have a number of middle to high-end restaurants which may be what contributes to the larger spread of traffic demand during the weekends as families head there to eat.

REGION 1: Breakdown of Demand over all 61 days

region1_demandZones.PNG

REGION 2: Breakdown of Demand over all 61 days

region2_demandZones.PNG

Interestingly, while the trend of the demand over all 61 days seems more or less the same for both the overall region and only the sub-area with large demand in region 1, this does not hold true for region 2.

In region 2, while the general trend does seem to be similar, it does not fit as nicely as in region 1. Additionally, the difference in the average demand for the overall region and just the high demand sub-area is surprisingly large

REGION 1: Breakdown of demand per week

region1_weeklyBreakdown.PNG

REGION 2: Breakdown of demand per week

region2_weeklyBreakdown.PNG

Generally, there is not much change in the demand for region 1 over each week. However, there is a very large spike in the traffic demand in week 8, days 50 to 56 which indicates that week 8 could be part of a school holiday period.

This is further supported by the fact that week 9 also has a spike in traffic demand, albeit a slightly smaller one.

However, region 2 seems to not show this same trend as the traffic demand in week 8 for region 2 is not the outlier with the highest traffic demand but is instead one of the weeks with an average traffic demand.

Furthermore. the traffic demand in region 2 fluctuates quite a bit from week to week, unlike in region 1 where the average demand seems to be maintained for each week with the exception of week 8

For this region, it seems as though week 6 has the highest traffic demand. In fact, it seems as if the traffic demand starts to rise from week 2 onwards and maintains this constant increase in traffic demand until week 6 whereby it starts to decrease..

REGION 1: Weekday vs Weekend

region1_weekday_weekend.PNG

REGION 2: Weekday vs Weekend

region2_weekend_weekday.PNG

The general trend of the traffic demand over the course of the day remains similar enough that if not for the slight decrease in traffic demand, which only occurs in the mornings, the two line plots could be superimposed onto each other.

However, contrary to our expectations, the traffic demand does not actually increase during the weekends. In fact, it almost seems as if the traffic demand is lower during the weekends as compared to weekdays.

This may be due to the fact that fewer people are rushing during the weekends and thus, they either walk or take public transport which would account for the smaller traffic demand during weekends.

This could also be due to the fact that in SEA, traffic is not as high during weekends, as people do not really have a reason to move around.

High Demand Regions (Region 1 and 2): Weekly breakdown

highDemand_weeklyBreakdown.PNG

High Demand Regions (Region 1 and 2): Weekday vs Weekend

highDemand_weekday_weekend.PNG

Low Demand Regions

In contrast to high demand regions, the spread of traffic demand doesn't really change much for low demand regions which does not fully support the theory that low demand regions are residential areas as we would expect individuals to move around the region more seeing as they are not stuck in school or in offices.

However, this constant spread in traffic demand could be explained by the fact that movements around a residential area would not change much whether it is a weekday or a weekend as on both types of days, individuals would leave their homes, but for different reasons.

REGION 3: Breakdown of demand per week

region3_weeklyBreakdown.PNG

Interestingly, this pattern of demand over the course of a single day is not similar to that of the pattern for regions with high demand.

While the traffic demand does also increase from the start of the day, it actually drops between about 0400 to 0800 which also encompasses the start of the usual morning peak hour period.

However, the traffic demand does start rising at around 0900 hours until about 1100 hours where the demand start dropping drastically, hitting around 0 demand after 1500 hours.

Of course, this is highly doubtful to be a real projection of the traffic demand over the course of the day. The most likely explanation is that the data for the region after 1500 hours was missing in the original dataset provided and the 0 demand shown is a result of us filling in the missing rows in the dataset.

As such, while we can likely still perform some analysis on this region, it may be better for us to focus on the mornings as well as to not use this region as the general baseline for all regions with low demand.

In fact, this region could very well actually have relatively high demand if the demand during the afternoon, after 1500 hours, follows the same trend as that of the other high demand regions we have seen.

REGION 4: Breakdown of demand per week

region4_weeklyBreakdown.PNG

This pattern seems to match up with that of the pattern that regions with relatively high demand seem to have, namely the increasing traffic demand throughout the morning followed by a sharp decrease until the evenings and a continuous increase throughout and into the next day.

However, the pattern in the morning does not match with that of the pattern in the morning for region 3. While region 3 has a minor fluctuation in traffic demand in the morning, this region fluctuates greatly thorughout the morning as it approaches and passes the morning peak hour period.

In fact, the only time where the traffic demand remains roughly smooth is after 1500 hours where it follows a general 'U' shape.

Interestingly, the traffic demand changes greatly each week for the mornings however, it all smooths back into the same general curve shape and demand after 1500 hours regardless of what the week is.

REGION 3: Weekday vs Weekend

region3_weekend_weekday.PNG

REGION 4: Weekday vs Weekend

region4_weekday_weekend.PNG

Similar to regions with high demand, the pattern of traffic demand is similar for both weekdays and weekends.

In fact, there is almost no difference aside from the slight decrease in the traffic demand in the mornings which was also observed in regions with high demand.

This lends more credence to the theory that this region could in fact be classified as a region of high demand if not for the missing rows of data after 1500 hours which decreased the mean demand for the region enough to make it seems as if it was a region with relatively low demand as compared to the other regions.

On the other hand, while there are some differences between region 3 and region 4, thus far, region 4 actually has more similarities to the regions with high demand than region 3, especially in regards to its general trend of traffic demand over the course of a single day.

At this point, we cannot conclude with a 100% certainty that region 3 is actually a region with high demand just based on what we have seen from region 4.

Low Demand Regions (Region 3 and 4 mean): Weekly breakdown

lowDemand_weeklyBreakdown.PNG

In general, low demand regions seem to have similar traffic demands in the second half of the day, after 1500 hours, regardless of what week it is. However, in direct contrast, the traffic demand in the morning fluctuates greatly depending on the week.

Low Demand Regions (Region 3 and 4): Weekday vs Weekend

lowDemand_weekday_weekend.PNG

Additionally, like regions with high demand, the pattern for weekdays versus. weekends are the same.

The general trend of the traffic demand over the course of the day remains similar enough that if not for the slight decrease in traffic demand, which only occurs in the mornings, the two line plots could be superimposed onto each other.

In conclusion:

From this we can see:

Problem Statement 3: Forecast the travel demand for next 15min / 1hour and predict areas with high travel demand

Getting Region 1 data

Data preparation (Sliding Window)

We would be using the data from one geohash that has the highest mean demand from region 1 to train the regression models since we are predicting on high demand

Note that the model would be trained on every timestamp of every of the 61 days

Let's use the sliding window approach to restructure the time series into supervised learning problem

The reason for this is so that we can use traditional ML models to do the forecasting

For example, what the sliding window does is convert the time series, which look like this:

Time Measure
1 100
2 110
3 108
4 115
5 120

and restructured into this:

X (feature) y (labels)
? 100
100 110
110 108
108 115
115 120
120 ?

The feature, X is the previous time step and the next time step is the label, y

Using the sliding window, the order of the time series demand data is preserved, while allowing for traditional supervised learning models to be used for time series

In our case, we would do a sliding window step of 1, due to forecasting of the next 15 minutes

Note that we would be removing the first and last row of the transformed time data, due to not being able to know the previous time step of the first time step (for the first row) and not knowing the next time step for the last row

Model Training

Due to the no free lunch theorem, we would have to try an array of regressors to determine which is best for the problem.

As such, we would be using:

All the models seem to be good predictors with above 0.90 training and test accuracies, however the difference between the training and test scores were the smallest for the GBR, which suggest that this model would be the best generaliser hence we would use that model as our final one

Let's see the predictions of demands made by the GBR

From the graphs, the GBR seems to predict / forecast all the demand over time almost exactly like the actual demand

Let's test the GBR out by giving it the demand of another geohash

Again, the GBR seems to be accurately forecasting the time series of this geohash accurately. Let's see how the model would fare on a geohash with lower demand

Now we can see that the model is not able to forecast geohashes for low demand that accurately, as seen from the regions significantly lower demand forecasted for the actual regions low demand.

Evaluation

Let' use some metrics to evaluate the performance of the model

First, we would use RMSE

The value of the RMSE of 0.054 on the training set and 0.057 on the test set is very small, when comparing those RMSE values to the standard deviation of the demand of the geohash, which is 0.25.

This shows that the demands predicted do not deviate much from the actual demands. Hence this shows that the model is forecasting demand accurately with low amount of errors

Second, let's see the residual plot

Taking into account that the demand is a value between 0 and 1,

From the residual plot above,

The GBR model prediction errors are quite minimal

How to forecast the travel demand

In order to do forecasting of the travel demand, the GBR model can be given the demand of the current time and it will output the demand of the next 15 minutes.

The demand given (input) must be a 2-dimensional numpy array.

The output would be a numpy array containing the predicted demand for next 15 minutes

In conclusion:

Our process of dealing with the problem

Before we started on the Problem Statements, there were some Feature Engineering and Data Manipulation that we had to do.

Namely:

  1. Filling in missing rows of data
  2. Adding a latitude and longitude column
  3. Separating the timestamp into hours and minutes
  4. Based on the day, obtaining whether the data point is a weekday or a weekend as well as the week number whereby week 1 is days 1 to 7.

Filling in missing rows of data

Originally, there are currently only 4,206,321 rows of data provided in the .csv file.

However, there should theorectically be 7,782,624 rows of data. ( 1329 unique geohashs 61 days 24 hours * 4 ).

As instructed by the teachers-in-charge, we shall treat all the missing rows of data as data whereby the demand at that point in time is 0. Thus, as there is a difference between having no rows of data and having data whereby the demand is 0, we shall move to fill in the missing rows of data.

In order to fill in the data, we will first create an empty Pandas DataFrame with the appropriate formatting and columns before slotting in the data from the training.csv file into the empty DataFrame.

Adding a latitude and longitude column

Separating the timestamp into hours and minutes

Based on the day, obtaining whether the data point is a weekday or a weekend as well as the week number whereby week 1 is days 1 to 7.